How often do you read The Student Life?
a. Every day
b. 3-5 times a week
c. Once a week
d. Rarely
Reading The Student Life
What do you think is the most common word in the titles of The Student Life opinion pieces?
Analyzing The Student Life
Reading The Student Life
How do you think the sentiments in opinion pieces in The Student Life compare across authors?
Roughly the same?
Wildly different?
Somewhere in between?
Analyzing The Student Life
All of this analysis is done in R!
(mostly) with tools you already know!
Common words in The Student Life titles
Code for the earlier plot:
data(stop_words) # from tidytexttsl_opinion_titles |> tidytext::unnest_tokens(word, title) |>anti_join(stop_words) |>count(word, sort =TRUE) |>slice_head(n =20) |>mutate(word =fct_reorder(word, n)) |>ggplot(aes(y = word, x = n, fill =log(n))) +geom_col(show.legend =FALSE) +theme_minimal(base_size =16) +labs(x ="Number of mentions",y ="Word",title ="The Student Life - Opinion pieces",subtitle ="Common words in the 500 most recent opinion pieces",caption ="Source: Data scraped from The Student Life on November 2, 2025" ) +theme(plot.title.position ="plot",plot.caption =element_text(color ="gray30") )
Avg sentiment scores of first paragraph
Code for the earlier plot:
afinn_sentiments <-get_sentiments("afinn") # need tidytext and textdatatsl_opinion_titles |> tidytext::unnest_tokens(word, first_p) |>anti_join(stop_words) |>left_join(afinn_sentiments) |>group_by(authors, title) |>summarize(total_sentiment =sum(value, na.rm =TRUE), .groups ="drop") |>group_by(authors) |>summarize(n_articles =n(),avg_sentiment =mean(total_sentiment, na.rm =TRUE), ) |>filter(n_articles >1&!is.na(authors)) |>arrange(desc(avg_sentiment)) |>slice(c(1:10, 69:78)) |>mutate(authors =fct_reorder(authors, avg_sentiment),neg_pos =if_else(avg_sentiment <0, "neg", "pos"),label_position =if_else(neg_pos =="neg", 0.25, -0.25) ) |>ggplot(aes(y = authors, x = avg_sentiment)) +geom_col(aes(fill = neg_pos), show.legend =FALSE) +geom_text(aes(x = label_position, label = authors, color = neg_pos),hjust =c(rep(1,10), rep(0, 10)),show.legend =FALSE,fontface ="bold" ) +geom_text(aes(label =round(avg_sentiment, 1)),hjust =c(rep(1.25,10), rep(-0.25, 10)),color ="white",fontface ="bold" ) +scale_fill_manual(values =c("neg"="#4d4009", "pos"="#FF4B91")) +scale_color_manual(values =c("neg"="#4d4009", "pos"="#FF4B91")) +scale_x_continuous(breaks =-5:5, minor_breaks =NULL) +scale_y_discrete(breaks =NULL) +coord_cartesian(xlim =c(-5, 5)) +labs(x ="negative ← Average sentiment score (AFINN) → positive",y =NULL,title ="The Student Life - Opinion pieces\nAverage sentiment scores of first paragraph by author",subtitle ="Top 10 average positive and negative scores",caption ="Source: Data scraped from The Student Life on November 2, 2025" ) +theme_void(base_size =16) +theme(plot.title =element_text(hjust =0.5),plot.subtitle =element_text(hjust =0.5, margin =unit(c(0.5, 0, 1, 0), "lines")),axis.text.y =element_blank(),plot.caption =element_text(color ="gray30") )
# A tibble: 500 × 4
title authors date first_p
<chr> <chr> <dttm> <chr>
1 Stop buying your books Sarah … 2025-04-04 08:03:00 from b…
2 The case for fleeing the count… Alex B… 2025-04-04 07:27:00 when t…
3 Tolerate thy neighbor Parker… 2025-04-04 07:22:00 it’s s…
4 Confronting furry hate Xavier… 2025-04-04 07:16:00 furrie…
5 Shame on the governor: Gavin N… Akshay… 2025-03-28 06:56:00 gavin …
6 Accessibility at the 5Cs requi… Zena A… 2025-03-28 06:42:00 althou…
7 Your spring break destination … Nicole… 2025-03-15 04:44:00 spring…
8 Pomona College’s Merritt Field… Katie … 2025-03-15 03:03:00 with l…
9 Seminars should be tech-free s… Elias … 2025-03-14 09:15:00 we hav…
10 The bitter truth to the bitter… Daniel… 2025-03-14 09:13:00 have y…
# ℹ 490 more rows
Web scraping
Scraping the web: what? why?
Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy & paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting information automatically and transforming it into a structured dataset
Two different scenarios:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
Hypertext Markup Language
Much of the data on the web is available as HTML - while it is structured (hierarchical), often it is not immediately available in a form useful for analysis (flat / tidy).
<html><head><title>This is a title</title></head><body><p align="center">Hello world!</p><br><div class="name" id="first">John</div><div class="name" id="last">Doe</div><div class="contact"><div class="home">555-555-1234</div><div class="home">555-555-2345</div><div class="work">555-555-9999</div><div class="fax">555-555-8888</div></div></body></html>
Some HTML elements
<html>: start of the HTML page
<head>: header information (metadata about the page)
<body>: everything that is on the page
<p>: paragraphs
<b>: bold
<table>: table
<div>: a container to group content together
<a>: the “anchor” element that creates a hyperlink
HTML attribute
An attribute in HTML is a name–value pair that gives extra information about an element. It sits inside the opening tag and modifies the element’s behavior, appearance, identity, or data.
Think of the attribute as the argument to the element (which would be the function in this analogy).
rvest
The rvest package makes basic processing and manipulation of HTML data straight forward
It is designed to work with pipelines built with |>
We will use a tool called SelectorGadget to help us identify the HTML elements of interest by constructing a CSS selector which can be used to subset the HTML document.
Some examples of basic selector syntax is below,
Selector
Example
Description
.class
.title
Select all elements with class=“title”
#id
#name
Select all elements with id=“name”
element
p
Select all <p> elements
element element
div p
Select all <p> elements inside a <div> element
element>element
div > p
Select all <p> elements with <div> as a direct parent
[attribute]
[class]
Select all elements with a class attribute
[attribute=value]
[class=title]
Select all elements with class=“title”
CSS classes and ids
class and id are used to style elements (e.g., change their color!). They are special types of attributes.
class can be applied to multiple different elements (class is identified with ., for example .name)
id is unique to each element (id is identified with #, for example, #first)
html <-read_html("<p> Hello,\n world! </p>")html |>html_element("p") |>html_text()
[1] " Hello,\n world! "
html |>html_element("p") |>html_text2()
[1] "Hello, world!"
Text with html_text() vs. html_text2()
html =read_html("<p> This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.<br>This third sentence should start on a new line. </p>")
html |>html_text()
[1] " \n This is the first sentence in the paragraph.\n This is the second sentence that should be on the same line as the first sentence.This third sentence should start on a new line.\n "
html |>html_text2()
[1] "This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.\nThis third sentence should start on a new line."
html_attr() always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.
div p vs div > p
div p selects all <p> elements within <div>, regardless of depth.
div > p selects only direct child <p> elements of <div>.
<div>
<p>This will be selected by both `div p` and `div > p`.</p>
<section>
<p>This will be selected only by `div p`, not by `div > p`. Because it is inside the section tag.</p>
</section>
</div>
SelectorGadget
SelectorGadget (selectorgadget.com) is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in.
Recap
Use the SelectorGadget identify elements you want to grab
Use the rvest R package to first read in the entire page (into R) and then parse the object you’ve read in to the elements you’re interested in
Put the components together in a data frame (a tibble) and analyze it like you analyze any other data
Plan
Read in the entire page
Scrape opinion title and save as title
Scrape author and save as author
Scrape date and save as date
Create a new data frame called tsl_opinion with variables title, author, and date
title <- tsl_page |>html_elements(".entry-title a") |>html_text()title
[1] "OPINION: Aesthetic feminism is super anti-feminist"
[2] "OPINION: Trump’s authoritarianism doesn’t listen to your No Kings Day cardboard signs"
[3] "OPINION: Chosen family is not enough, we must be blood brothers"
[4] "OPINION: Why San Quentin’s reform model doesn’t do the trick"
[5] "OPINION: Under media oligarchy, TikTok isn’t the problem"
[6] "OPINION: Trump’s bigotry drove Latino conservatism"
[7] "OPINION: LA can’t keep neglecting downtown; Convention Center expansion isn’t the answer"
[8] "OPINION: Use a condom before it’s too late"
[9] "OPINION: On Poppin’ and Lockin’: How I developed a breakdancing addiction"
[10] "OPINION: Celebrities can lose weight and still preach body positivity"
title <- title |>str_remove("OPINION: ")title
[1] "Aesthetic feminism is super anti-feminist"
[2] "Trump’s authoritarianism doesn’t listen to your No Kings Day cardboard signs"
[3] "Chosen family is not enough, we must be blood brothers"
[4] "Why San Quentin’s reform model doesn’t do the trick"
[5] "Under media oligarchy, TikTok isn’t the problem"
[6] "Trump’s bigotry drove Latino conservatism"
[7] "LA can’t keep neglecting downtown; Convention Center expansion isn’t the answer"
[8] "Use a condom before it’s too late"
[9] "On Poppin’ and Lockin’: How I developed a breakdancing addiction"
[10] "Celebrities can lose weight and still preach body positivity"
# A tibble: 10 × 3
title author date
<chr> <chr> <dttm>
1 Aesthetic feminism is super anti-femini… Ansle… 2025-10-30 19:48:00
2 Trump’s authoritarianism doesn’t listen… Jason… 2025-10-30 19:22:00
3 Chosen family is not enough, we must be… Alex … 2025-10-30 19:22:00
4 Why San Quentin’s reform model doesn’t … Leili… 2025-10-30 19:22:00
5 Under media oligarchy, TikTok isn’t the… Nicol… 2025-10-24 01:49:00
6 Trump’s bigotry drove Latino conservati… Rafae… 2025-10-24 01:34:00
7 LA can’t keep neglecting downtown; Conv… Nicho… 2025-10-24 01:29:00
8 Use a condom before it’s too late Alex … 2025-10-24 01:13:00
9 On Poppin’ and Lockin’: How I developed… Leili… 2025-10-10 02:47:00
10 Celebrities can lose weight and still p… Joell… 2025-10-10 02:41:00